{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 2: Faster and Cheaper Queries with Table Partitions and Clustering\n",
    "\n",
    "### Learning Objectives\n",
    "- Create a SQL query to analyze sales from our baseline table\n",
    "- Analyze the query execution plan for performance optimization opportunities\n",
    "- Find out how much data was discarded in full table scans\n",
    "- Analyze the same query against a partitioned table\n",
    "- Learn how to create partitioned tables with SQL DDL\n",
    "- Create partitioned tables for the entire dataset\n",
    "- Run benchmark queries and compare performance to our baseline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze current architecture for partitions\n",
    "\n",
    "Let's find the largest table and see the current architecture.\n",
    "\n",
    "From a previous lab we created the below query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dataset_id</th>\n",
       "      <th>table_id</th>\n",
       "      <th>size_gb</th>\n",
       "      <th>creation_time</th>\n",
       "      <th>last_modified_time</th>\n",
       "      <th>row_count</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>tpcds_2t_baseline</td>\n",
       "      <td>store_sales</td>\n",
       "      <td>1545.13</td>\n",
       "      <td>2019-10-13 19:15:03.190000+00:00</td>\n",
       "      <td>2019-10-13 19:15:03.190000+00:00</td>\n",
       "      <td>5762820700</td>\n",
       "      <td>table</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          dataset_id     table_id  size_gb                    creation_time  \\\n",
       "0  tpcds_2t_baseline  store_sales  1545.13 2019-10-13 19:15:03.190000+00:00   \n",
       "\n",
       "                last_modified_time   row_count   type  \n",
       "0 2019-10-13 19:15:03.190000+00:00  5762820700  table  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT \n",
    "  dataset_id,\n",
    "  table_id,\n",
    "  -- Convert bytes to GB.\n",
    "  ROUND(size_bytes/pow(10,9),2) as size_gb,\n",
    "  -- Convert UNIX EPOCH to a timestamp.\n",
    "  TIMESTAMP_MILLIS(creation_time) AS creation_time,\n",
    "  TIMESTAMP_MILLIS(last_modified_time) as last_modified_time,\n",
    "  row_count,\n",
    "  CASE \n",
    "    WHEN type = 1 THEN 'table'\n",
    "    WHEN type = 2 THEN 'view'\n",
    "  ELSE NULL\n",
    "  END AS type\n",
    "FROM\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.__TABLES__`\n",
    "ORDER BY size_gb DESC\n",
    "LIMIT 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our `store_sales` table is 1,545 GB and 5.7 Billion rows. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a standard sales report \n",
    "\n",
    "Your manager has asked you to query the existing data warehouse tables and build a report that shows:\n",
    "- the top 10 sales `ss_net_paid` for all sales on or after `2000-01-01`\n",
    "- include the name of the product\n",
    "- include the name and email of the customer and whether they are a preferred customer\n",
    "- exclude customers with a NULL `ss_customer_sk`\n",
    "- include the date and time of the order as a formatted timestamp\n",
    "\n",
    "__Note:__ Develop the query in the [BigQuery console](https://console.cloud.google.com/bigquery/) so you will be able to (1) have the benefit of the query validator as you type and (2) you can view the Execution Plan after your query runs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --verbose\n",
    "SELECT  \n",
    "  PARSE_TIMESTAMP (\n",
    "  \n",
    "  \"%Y-%m-%d %T %p\",\n",
    "  \n",
    "  CONCAT(\n",
    "    CAST(d_date AS STRING),\n",
    "    ' ',\n",
    "    CAST(t_hour AS STRING),\n",
    "    ':',\n",
    "    CAST(t_minute AS STRING),\n",
    "    ':',\n",
    "    CAST(t_second AS STRING),\n",
    "    ' ',t_am_pm)\n",
    "    \n",
    "    ,\"America/Los_Angeles\")\n",
    "    \n",
    "    AS timestamp,\n",
    "  s.ss_item_sk,\n",
    "  i.i_product_name,\n",
    "  s.ss_customer_sk,\n",
    "  c.c_first_name,\n",
    "  c.c_last_name,\n",
    "  c.c_email_address,\n",
    "  c.c_preferred_cust_flag,\n",
    "  s.ss_quantity,\n",
    "  s.ss_net_paid\n",
    "FROM\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.store_sales` AS s\n",
    "  JOIN \n",
    "  `qwiklabs-resources.tpcds_2t_baseline.date_dim` AS d\n",
    "  ON s.ss_sold_date_sk = d.d_date_sk\n",
    "  \n",
    "  JOIN\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.time_dim` AS t\n",
    "  ON s.ss_sold_time_sk = t.t_time_sk\n",
    "  \n",
    "  JOIN\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.item` AS i\n",
    "  ON s.ss_item_sk = i.i_item_sk\n",
    "  \n",
    "  JOIN \n",
    "  `qwiklabs-resources.tpcds_2t_baseline.customer` AS c\n",
    "  ON s.ss_customer_sk = c. c_customer_sk\n",
    "  \n",
    "WHERE d_date >= '2000-01-01' AND ss_customer_sk IS NOT NULL\n",
    "ORDER BY ss_net_paid DESC\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This simple report took 40+ seconds to execute and processed over 200+ GB of data. Let's see where we can improve."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gaining insight from the Query Execution Details (part 1: high level stats)\n",
    "\n",
    "Learning how BigQuery processes your query under-the-hood is critical to understanding where you can improve performance. \n",
    "\n",
    "After you executed the previous query, in the BigQuery console click on  __Execution details__\n",
    "\n",
    "<img src=\"img/bq-exec-details-ui.png\" alt=\"BigQuery Execution Plan\" style=\"border: 2px solid #eee; width: 700px; float: left;\"/>\n",
    "\n",
    "<p style=\"clear: both; padding: 20px 0;\">\n",
    "Your query plan should be largely similar to ours below. Scan through the execution statistics and answer the questions that follow.\n",
    "</p>\n",
    "\n",
    "<img src=\"img/bq-exec-plan-1.png\" alt=\"BigQuery Execution Plan\" style=\"border: 2px solid #eee; width: 700px; float: left;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Slot time\n",
    "\n",
    "As you can see above, your query took `27 seconds` to process 5.7 Billion rows. So what does the `10hr 35min` slot time metric mean?\n",
    "\n",
    "Recall from our discussion in Lab 1 that inside the BigQuery service are lots of virtual machines that massively process your data and query logic in parallel. These workers, or \"slots\", work together to process a single query job really quickly.\n",
    "\n",
    "The BigQuery engine fully-manages the task of taking your query and farming it out to a workers who get the raw data and process the work to be done. \n",
    "\n",
    "So say we had 30 minutes of slot time or 1800 seconds. If the query took 27 seconds in total to run, \n",
    "but it was 36000 seconds (or 10 hours) worth of work, how many workers at minimum worked on it? \n",
    "36000/27 = 1,333\n",
    "\n",
    "And that's assuming each worker instantly had all the data it needed (no shuffling of data between workers) and was at full capacity for all 27 seconds!\n",
    "\n",
    "The worker [quota](https://cloud.google.com/bigquery/quotas#queries) for an on-demand query is 2,000 slots at one time so we want to find ways we can optimize and reduce the resources consumed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bytes shuffled\n",
    "\n",
    "We had `382 GB` of data shuffled. What does that mean? \n",
    "\n",
    "First let's explore the architecture of BigQuery and how it can process PB+ datasets in seconds. The BigQuery team explains how the managed service is setup in this [detailed blog post](https://cloud.google.com/blog/products/gcp/bigquery-under-the-hood) where the below diagram is sourced:\n",
    "\n",
    "![img/bq-under-hood.png](img/bq-under-hood.png)\n",
    "\n",
    "The actual Google services that the engine uses are\n",
    "- Dremel (the execution engine)\n",
    "- Jupiter (the petabit scale Google datacenter network)\n",
    "- Colossus (distributed clusters of storage)\n",
    "\n",
    "#### Storage\n",
    "Our `store_sales` table (1,545 GB and 5.7 Billion rows) isn't stored on just one server. In fact, BigQuery [compresses and stores each column of the data](https://cloud.google.com/blog/products/gcp/inside-capacitor-bigquerys-next-generation-columnar-storage-format) and stores pieces of it across many commodity servers\n",
    "\n",
    "#### Compute\n",
    "When it comes time to process a query like `top sales after the year 2000`, BigQuery starts up a fleet of workers to grab and process pieces of data. Since most of the time no single worker has a complete picture of all 5.7 Billon rows, they need to communicate with each other by passing data back and forth. This fast in-memory data `shuffling` or `repartitioning` process can be time and resource intensive. \n",
    "\n",
    "Below is an example diagram that shows the work-shuffle-work process for each worker:\n",
    "\n",
    "![img/bq-shuffle.png](img/bq-shuffle.png)\n",
    "\n",
    "As you can see in the Execution details, workers have a variety of tasks (waiting for data, reading it, performing computations, and writing data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gaining insight from the Query Execution Details (part 2: repartitioning / shuffling)\n",
    "\n",
    "Here is the next part of our execution plan after the data is read from disk. Note the longer blue bars indicate more time spent by workers. Generally the average worker (avg) and the slowest worker (max) are aligned unless your dataset is unbalanced and skewed heavily to a few values (hotspots). \n",
    "\n",
    "Here you can see time spent joining against other datasets and then many repartitions afterward to shuffle the billions of rows across workers for processing. You'll note that the input and output row counts for repartitions for the same as they are purely a shuffling effort across workers.\n",
    "\n",
    "<img src=\"img/bq-exec-plan-2.png\" alt=\"BigQuery Execution Plan\" style=\"border: 2px solid #eee; width: 700px; float: left;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What triggers repartitions? \n",
    "\n",
    "The performance-expensive operations in our query were:\n",
    "1. looking at every record and comparing to see if it was before or after `2000-01-01`\n",
    "2. sorting large volumes of data by timestamp\n",
    "3. computing a calculated field \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### How much data was unused?\n",
    "\n",
    "BigQuery had to scan and compare all records in the dataset to see if it matched our date condition. What percent of records were ultimately thrown away (pre-2000)?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>all_records</th>\n",
       "      <th>after_2000</th>\n",
       "      <th>before_2000</th>\n",
       "      <th>percent_dataset_used</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>5,503,485,473</td>\n",
       "      <td>3,314,029,432</td>\n",
       "      <td>2,189,456,041</td>\n",
       "      <td>0.6022</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     all_records     after_2000    before_2000  percent_dataset_used\n",
       "0  5,503,485,473  3,314,029,432  2,189,456,041                0.6022"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "\n",
    "WITH stats AS (\n",
    "\n",
    "SELECT \n",
    "  COUNT(*) AS all_records,\n",
    "  COUNTIF(d_date >= '2000-01-01') AS after_2000,\n",
    "  COUNTIF(d_date < '2000-01-01') AS before_2000\n",
    "  \n",
    "FROM\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.store_sales` AS s\n",
    "  JOIN \n",
    "  `qwiklabs-resources.tpcds_2t_baseline.date_dim` AS d\n",
    "  ON s.ss_sold_date_sk = d.d_date_sk\n",
    ")\n",
    "\n",
    "SELECT\n",
    "  format(\"%'d\",all_records) AS all_records,\n",
    "  format(\"%'d\",after_2000) AS after_2000,\n",
    "  format(\"%'d\",before_2000) AS before_2000,\n",
    "  ROUND(after_2000 / all_records,4) AS percent_dataset_used\n",
    "FROM stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only ended up using 60% of our dataset for analysis. \n",
    "\n",
    "Isn't there a faster way of eliminating the other 40% of records without having to check the date value of each row? \n",
    "\n",
    "Yes! With date-partitioned tables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reducing data scanned with Partitioned tables\n",
    "\n",
    "Partitioning automatically buckets groups records of data based on a date or timestamp value which enables fast filtering by date.\n",
    "\n",
    "We've already created a new table called `qwiklabs-resources.tpcds_2t_flat_part_clust.partitioned_table` which we will show you how to do in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executing query with job ID: 60bc2893-6e6d-4a0a-98d9-f076023530d9\n",
      "Query executing: 20.22s\n",
      "Query complete after 22.18s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>timestamp</th>\n",
       "      <th>ss_item_sk</th>\n",
       "      <th>i_product_name</th>\n",
       "      <th>ss_customer_sk</th>\n",
       "      <th>c_first_name</th>\n",
       "      <th>c_last_name</th>\n",
       "      <th>c_email_address</th>\n",
       "      <th>c_preferred_cust_flag</th>\n",
       "      <th>ss_quantity</th>\n",
       "      <th>ss_net_paid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>2002-12-07 19:59:26+00:00</td>\n",
       "      <td>43596</td>\n",
       "      <td>callyn stantipriese</td>\n",
       "      <td>437938</td>\n",
       "      <td>Emily</td>\n",
       "      <td>Miles</td>\n",
       "      <td>Emily.Miles@mJbptc1eoN.org</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>19972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2001-01-24 03:01:01+00:00</td>\n",
       "      <td>17457</td>\n",
       "      <td>ationantieseationought</td>\n",
       "      <td>320021</td>\n",
       "      <td>Carlos</td>\n",
       "      <td>Knight</td>\n",
       "      <td>Carlos.Knight@KCOyPHE1rH.com</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>19972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>2001-07-21 15:35:26+00:00</td>\n",
       "      <td>36229</td>\n",
       "      <td>n stableablecallypri</td>\n",
       "      <td>5010096</td>\n",
       "      <td>Amy</td>\n",
       "      <td>Lugo</td>\n",
       "      <td>Amy.Lugo@teJtjEufGCvXiRuph.com</td>\n",
       "      <td>Y</td>\n",
       "      <td>100</td>\n",
       "      <td>19972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>2002-01-05 01:32:17+00:00</td>\n",
       "      <td>2748</td>\n",
       "      <td>eingeseationable</td>\n",
       "      <td>4719256</td>\n",
       "      <td>Andrew</td>\n",
       "      <td>Henderson</td>\n",
       "      <td>Andrew.Henderson@O9Bv4tNyUsDHAt7pyb.com</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>19972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>2000-07-14 02:05:37+00:00</td>\n",
       "      <td>26269</td>\n",
       "      <td>n stcallyablecallyable</td>\n",
       "      <td>6127654</td>\n",
       "      <td>Ronald</td>\n",
       "      <td>Jung</td>\n",
       "      <td>Ronald.Jung@Maoo95C.edu</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>19972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>2002-11-19 03:22:49+00:00</td>\n",
       "      <td>17574</td>\n",
       "      <td>eseationantiationought</td>\n",
       "      <td>7200746</td>\n",
       "      <td>Justin</td>\n",
       "      <td>Black</td>\n",
       "      <td>Justin.Black@iiTb4pzrv2QDa.org</td>\n",
       "      <td>Y</td>\n",
       "      <td>100</td>\n",
       "      <td>19964</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>2002-09-13 00:01:11+00:00</td>\n",
       "      <td>15363</td>\n",
       "      <td>pricallypriantiought</td>\n",
       "      <td>2244656</td>\n",
       "      <td>Victor</td>\n",
       "      <td>Baker</td>\n",
       "      <td>Victor.Baker@tVs0Nva.edu</td>\n",
       "      <td>Y</td>\n",
       "      <td>100</td>\n",
       "      <td>19964</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>2000-12-31 16:26:13+00:00</td>\n",
       "      <td>40169</td>\n",
       "      <td>n stcallyoughtbarese</td>\n",
       "      <td>5965669</td>\n",
       "      <td>Shane</td>\n",
       "      <td>Martinez</td>\n",
       "      <td>Shane.Martinez@rTIQV8rHUiBj.com</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>19928</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>2001-04-06 17:44:42+00:00</td>\n",
       "      <td>21245</td>\n",
       "      <td>antieseableoughtable</td>\n",
       "      <td>1514229</td>\n",
       "      <td>Daniel</td>\n",
       "      <td>Cannon</td>\n",
       "      <td>Daniel.Cannon@Xe.org</td>\n",
       "      <td>Y</td>\n",
       "      <td>100</td>\n",
       "      <td>19928</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>2000-09-26 16:34:52+00:00</td>\n",
       "      <td>45863</td>\n",
       "      <td>pricallyeingantiese</td>\n",
       "      <td>2433894</td>\n",
       "      <td>Joyce</td>\n",
       "      <td>Robles</td>\n",
       "      <td>Joyce.Robles@jjiqRB7uJl7.org</td>\n",
       "      <td>Y</td>\n",
       "      <td>100</td>\n",
       "      <td>19928</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  timestamp  ss_item_sk          i_product_name  \\\n",
       "0 2002-12-07 19:59:26+00:00       43596     callyn stantipriese   \n",
       "1 2001-01-24 03:01:01+00:00       17457  ationantieseationought   \n",
       "2 2001-07-21 15:35:26+00:00       36229    n stableablecallypri   \n",
       "3 2002-01-05 01:32:17+00:00        2748        eingeseationable   \n",
       "4 2000-07-14 02:05:37+00:00       26269  n stcallyablecallyable   \n",
       "5 2002-11-19 03:22:49+00:00       17574  eseationantiationought   \n",
       "6 2002-09-13 00:01:11+00:00       15363    pricallypriantiought   \n",
       "7 2000-12-31 16:26:13+00:00       40169    n stcallyoughtbarese   \n",
       "8 2001-04-06 17:44:42+00:00       21245    antieseableoughtable   \n",
       "9 2000-09-26 16:34:52+00:00       45863     pricallyeingantiese   \n",
       "\n",
       "   ss_customer_sk c_first_name c_last_name  \\\n",
       "0          437938        Emily       Miles   \n",
       "1          320021       Carlos      Knight   \n",
       "2         5010096          Amy        Lugo   \n",
       "3         4719256       Andrew   Henderson   \n",
       "4         6127654       Ronald        Jung   \n",
       "5         7200746       Justin       Black   \n",
       "6         2244656       Victor       Baker   \n",
       "7         5965669        Shane    Martinez   \n",
       "8         1514229       Daniel      Cannon   \n",
       "9         2433894        Joyce      Robles   \n",
       "\n",
       "                           c_email_address c_preferred_cust_flag  ss_quantity  \\\n",
       "0               Emily.Miles@mJbptc1eoN.org                     N          100   \n",
       "1             Carlos.Knight@KCOyPHE1rH.com                     N          100   \n",
       "2           Amy.Lugo@teJtjEufGCvXiRuph.com                     Y          100   \n",
       "3  Andrew.Henderson@O9Bv4tNyUsDHAt7pyb.com                     N          100   \n",
       "4                  Ronald.Jung@Maoo95C.edu                     N          100   \n",
       "5           Justin.Black@iiTb4pzrv2QDa.org                     Y          100   \n",
       "6                 Victor.Baker@tVs0Nva.edu                     Y          100   \n",
       "7          Shane.Martinez@rTIQV8rHUiBj.com                     N          100   \n",
       "8                     Daniel.Cannon@Xe.org                     Y          100   \n",
       "9             Joyce.Robles@jjiqRB7uJl7.org                     Y          100   \n",
       "\n",
       "  ss_net_paid  \n",
       "0       19972  \n",
       "1       19972  \n",
       "2       19972  \n",
       "3       19972  \n",
       "4       19972  \n",
       "5       19964  \n",
       "6       19964  \n",
       "7       19928  \n",
       "8       19928  \n",
       "9       19928  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --verbose\n",
    "SELECT  \n",
    "  timestamp,\n",
    "  s.ss_item_sk,\n",
    "  i.i_product_name,\n",
    "  s.ss_customer_sk,\n",
    "  c.c_first_name,\n",
    "  c.c_last_name,\n",
    "  c.c_email_address,\n",
    "  c.c_preferred_cust_flag,\n",
    "  s.ss_quantity,\n",
    "  s.ss_net_paid\n",
    "FROM\n",
    "  `qwiklabs-resources.tpcds_2t_flat_part_clust.partitioned_table` AS s\n",
    "    \n",
    "  /*  Date and time tables denormalized as part of the partitioned table\n",
    "  JOIN \n",
    "  `qwiklabs-resources.tpcds_2t_baseline.date_dim` AS d\n",
    "  ON s.ss_sold_date_sk = d.d_date_sk\n",
    "  \n",
    "  JOIN\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.time_dim` AS t\n",
    "  ON s.ss_sold_time_sk = t.t_time_sk\n",
    "  */  \n",
    "  \n",
    "  JOIN\n",
    "  `qwiklabs-resources.tpcds_2t_baseline.item` AS i\n",
    "  ON s.ss_item_sk = i.i_item_sk\n",
    "  \n",
    "  JOIN \n",
    "  `qwiklabs-resources.tpcds_2t_baseline.customer` AS c\n",
    "  ON s.ss_customer_sk = c. c_customer_sk\n",
    "  \n",
    "WHERE DATE(timestamp) >= '2000-01-01' AND ss_customer_sk IS NOT NULL\n",
    "ORDER BY ss_net_paid DESC\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Performance comparison\n",
    "\n",
    "|                 \t| Original \t| Partitioned \t| Improvement        \t|   \t|\n",
    "|-----------------\t|----------\t|-------------\t|--------------------\t|---\t|\n",
    "| Query time      \t| 27s      \t| 24.2s       \t| 10% faster         \t|   \t|\n",
    "| Bytes processed \t| 290 GB   \t| 144 GB      \t| 50% cheaper        \t|   \t|\n",
    "| Slot time       \t| 10 hr    \t| 7 hr        \t| 30% more efficient \t|   \t|\n",
    "| Bytes Shuffled  \t| 382 GB   \t| 293 GB      \t| 23% more efficient \t|   \t|\n",
    "\n",
    "![img/before-partition.png](img/before-partition.png)\n",
    "\n",
    "![img/after-partition.png](img/after-partition.png)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating Partitioned Tables\n",
    "\n",
    "You can create partitioned tables in a number of ways:\n",
    "\n",
    "1. Using [SQL DDL](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement) From the results of a query\n",
    "2. When you create a new table schema\n",
    "\n",
    "For a full list check out the [documentation](https://cloud.google.com/bigquery/docs/creating-column-partitions). \n",
    "\n",
    "Note that date-partitioned tables (dedicated user-specified date/time column) are different than [ingestion-time partitioned tables](https://cloud.google.com/bigquery/docs/creating-partitioned-tables) which are also available but not covered here. \n",
    "\n",
    "Below is the example query to create the partitioned table we used earlier (no need to execute it). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```sql\n",
    "CREATE OR REPLACE TABLE IF NOT EXISTS `qwiklabs-resources.tpcds_2t_flat_part_clust.partitioned_table`\n",
    "PARTITION BY DATE(timestamp) -- You define the column to partition on (it must be a date or time)\n",
    "CLUSTER BY ss_net_paid -- Clustering is an added benefit for partitioned tables and explained later\n",
    "OPTIONS (\n",
    "  require_partition_filter=true -- You can mandate users must provide a WHERE clause when querying\n",
    "  )\n",
    "AS\n",
    "\n",
    "SELECT  \n",
    "  PARSE_TIMESTAMP (\n",
    "  \n",
    "  \"%Y-%m-%d %T %p\",\n",
    "  \n",
    "  CONCAT(\n",
    "    CAST(d_date AS STRING),\n",
    "    ' ',\n",
    "    CAST(t_hour AS STRING),\n",
    "    ':',\n",
    "    CAST(t_minute AS STRING),\n",
    "    ':',\n",
    "    CAST(t_second AS STRING),\n",
    "    ' ',t_am_pm)\n",
    "    \n",
    "    ,\"America/Los_Angeles\")\n",
    "    \n",
    "    AS timestamp,\n",
    "  s.*\n",
    "FROM\n",
    "  `qwiklabs-resources.tpcds_2t_flat_part_clust.store_sales` AS s\n",
    "  JOIN \n",
    "  `qwiklabs-resources.tpcds_2t_flat_part_clust.date_dim` AS d\n",
    "  ON s.ss_sold_date_sk = d.d_date_sk\n",
    "  \n",
    "  JOIN\n",
    "  `qwiklabs-resources.tpcds_2t_flat_part_clust.time_dim` AS t\n",
    "  ON s.ss_sold_time_sk = t.t_time_sk\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting the TCP-DS tables to Partitioned tables\n",
    "\n",
    "We can quickly verify that none of the tables in our baseline schema have partitioned columns by using the metadata table `INFORMATION_SCHEMA`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>table_catalog</th>\n",
       "      <th>table_schema</th>\n",
       "      <th>table_name</th>\n",
       "      <th>column_name</th>\n",
       "      <th>ordinal_position</th>\n",
       "      <th>is_nullable</th>\n",
       "      <th>data_type</th>\n",
       "      <th>is_generated</th>\n",
       "      <th>generation_expression</th>\n",
       "      <th>is_stored</th>\n",
       "      <th>is_hidden</th>\n",
       "      <th>is_updatable</th>\n",
       "      <th>is_system_defined</th>\n",
       "      <th>is_partitioning_column</th>\n",
       "      <th>clustering_ordinal_position</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable, data_type, is_generated, generation_expression, is_stored, is_hidden, is_updatable, is_system_defined, is_partitioning_column, clustering_ordinal_position]\n",
       "Index: []"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT * FROM \n",
    " `qwiklabs-resources.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`\n",
    "WHERE \n",
    "  is_partitioning_column = 'YES' OR clustering_ordinal_position IS NOT NULL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a new dataset to hold our partitioned tables\n",
    "\n",
    "Let's leave the existing baseline dataset and tables and create a new dataset titled `tpcds_2t_flat_part_clust` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BigQuery dataset already exists, let's not recreate it.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "## Create a BigQuery dataset for tpcds_2t_flat_part_clust if it doesn't exist\n",
    "datasetexists=$(bq ls -d | grep -w tpcds_2t_flat_part_clust)\n",
    "\n",
    "if [ -n \"$datasetexists\" ]; then\n",
    "    echo -e \"BigQuery dataset already exists, let's not recreate it.\"\n",
    "\n",
    "else\n",
    "    echo \"Creating BigQuery dataset titled: tpcds_2t_flat_part_clust\"\n",
    "    \n",
    "    bq --location=US mk --dataset \\\n",
    "        --description 'Partitioned and Clustered' \\\n",
    "        $PROJECT:tpcds_2t_flat_part_clust\n",
    "   echo \"\\nHere are your current datasets:\"\n",
    "   bq ls\n",
    "fi   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a new empty partitioned table \n",
    "\n",
    "Let's pick one table to add partitioning and clustering to. It's easiest to add a partitioning column to a data table that has existing date or timestamp columns. Here we will use the `store` table which is a dimensional table for the name and address of each storefront for our business. \n",
    "\n",
    "#### What to cluster on?\n",
    "\n",
    "Finding a column to partition on is often the easy part. BigQuery also supports [clustering](https://cloud.google.com/bigquery/docs/clustered-tables) on partitioned tables which can provide performance improvements for commonly filtered or sorted queries. The column(s) you specify are used to colocate related data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>s_store_sk</th>\n",
       "      <th>s_store_id</th>\n",
       "      <th>s_rec_start_date</th>\n",
       "      <th>s_rec_end_date</th>\n",
       "      <th>s_closed_date_sk</th>\n",
       "      <th>s_store_name</th>\n",
       "      <th>s_number_employees</th>\n",
       "      <th>s_floor_space</th>\n",
       "      <th>s_hours</th>\n",
       "      <th>s_manager</th>\n",
       "      <th>...</th>\n",
       "      <th>s_street_name</th>\n",
       "      <th>s_street_type</th>\n",
       "      <th>s_suite_number</th>\n",
       "      <th>s_city</th>\n",
       "      <th>s_county</th>\n",
       "      <th>s_state</th>\n",
       "      <th>s_zip</th>\n",
       "      <th>s_country</th>\n",
       "      <th>s_gmt_offset</th>\n",
       "      <th>s_tax_precentage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>0 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [s_store_sk, s_store_id, s_rec_start_date, s_rec_end_date, s_closed_date_sk, s_store_name, s_number_employees, s_floor_space, s_hours, s_manager, s_market_id, s_geography_class, s_market_desc, s_market_manager, s_division_id, s_division_name, s_company_id, s_company_name, s_street_number, s_street_name, s_street_type, s_suite_number, s_city, s_county, s_state, s_zip, s_country, s_gmt_offset, s_tax_precentage]\n",
       "Index: []\n",
       "\n",
       "[0 rows x 29 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE TABLE tpcds_2t_flat_part_clust.store(\n",
    "    s_store_sk                int64               NOT NULL,\n",
    "    s_store_id                string              NOT NULL,\n",
    "    s_rec_start_date          date                          ,\n",
    "    s_rec_end_date            date                          ,\n",
    "    s_closed_date_sk          int64                       ,\n",
    "    s_store_name              string                   ,\n",
    "    s_number_employees        int64                       ,\n",
    "    s_floor_space             int64                       ,\n",
    "    s_hours                   string                      ,\n",
    "    s_manager                 string                   ,\n",
    "    s_market_id               int64                       ,\n",
    "    s_geography_class         string                  ,\n",
    "    s_market_desc             string                  ,\n",
    "    s_market_manager          string                   ,\n",
    "    s_division_id             int64                       ,\n",
    "    s_division_name           string                   ,\n",
    "    s_company_id              int64                       ,\n",
    "    s_company_name            string                   ,\n",
    "    s_street_number           string                   ,\n",
    "    s_street_name             string                   ,\n",
    "    s_street_type             string                      ,\n",
    "    s_suite_number            string                      ,\n",
    "    s_city                    string                   ,\n",
    "    s_county                  string                   ,\n",
    "    s_state                   string                       ,\n",
    "    s_zip                     string                      ,\n",
    "    s_country                 string                   ,\n",
    "    s_gmt_offset              numeric                  ,\n",
    "    s_tax_precentage          numeric                  )\n",
    "\n",
    "    # TODO: Specify a date field to partition on and a field to cluster on:\n",
    "\tPARTITION BY s_rec_start_date\n",
    "\tCLUSTER BY s_zip;\n",
    "    \n",
    "    SELECT * FROM tpcds_2t_flat_part_clust.store LIMIT 0;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you have the empty table, it's time to populate it with data. This can take a while, feel free to cancel the execution and continue with the lab. We'll use the BigQuery Data Transfer Service later to copy over the entire dataset in seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>s_store_sk</th>\n",
       "      <th>s_store_id</th>\n",
       "      <th>s_rec_start_date</th>\n",
       "      <th>s_rec_end_date</th>\n",
       "      <th>s_closed_date_sk</th>\n",
       "      <th>s_store_name</th>\n",
       "      <th>s_number_employees</th>\n",
       "      <th>s_floor_space</th>\n",
       "      <th>s_hours</th>\n",
       "      <th>s_manager</th>\n",
       "      <th>...</th>\n",
       "      <th>s_street_name</th>\n",
       "      <th>s_street_type</th>\n",
       "      <th>s_suite_number</th>\n",
       "      <th>s_city</th>\n",
       "      <th>s_county</th>\n",
       "      <th>s_state</th>\n",
       "      <th>s_zip</th>\n",
       "      <th>s_country</th>\n",
       "      <th>s_gmt_offset</th>\n",
       "      <th>s_tax_precentage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>26</td>\n",
       "      <td>AAAAAAAAKBAAAAAA</td>\n",
       "      <td>None</td>\n",
       "      <td>2000-03-12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>cally</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8AM-4PM</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>Avenue</td>\n",
       "      <td>Suite 50</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>GA</td>\n",
       "      <td>None</td>\n",
       "      <td>United States</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>96</td>\n",
       "      <td>AAAAAAAAOFAAAAAA</td>\n",
       "      <td>2001-03-13</td>\n",
       "      <td>None</td>\n",
       "      <td>2450893.0</td>\n",
       "      <td>cally</td>\n",
       "      <td>288.0</td>\n",
       "      <td>5453633.0</td>\n",
       "      <td>8AM-4PM</td>\n",
       "      <td>Denis Shah</td>\n",
       "      <td>...</td>\n",
       "      <td>View</td>\n",
       "      <td>Wy</td>\n",
       "      <td>Suite 340</td>\n",
       "      <td>Shiloh</td>\n",
       "      <td>Franklin Parish</td>\n",
       "      <td>LA</td>\n",
       "      <td>79275</td>\n",
       "      <td>United States</td>\n",
       "      <td>-6</td>\n",
       "      <td>0.11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>175</td>\n",
       "      <td>AAAAAAAAPKAAAAAA</td>\n",
       "      <td>1997-03-13</td>\n",
       "      <td>None</td>\n",
       "      <td>NaN</td>\n",
       "      <td>anti</td>\n",
       "      <td>212.0</td>\n",
       "      <td>8062566.0</td>\n",
       "      <td>8AM-4PM</td>\n",
       "      <td>Paul Vela</td>\n",
       "      <td>...</td>\n",
       "      <td>Third Railroad</td>\n",
       "      <td>Ln</td>\n",
       "      <td>Suite D</td>\n",
       "      <td>Shady Grove</td>\n",
       "      <td>Luce County</td>\n",
       "      <td>MI</td>\n",
       "      <td>42812</td>\n",
       "      <td>United States</td>\n",
       "      <td>-5</td>\n",
       "      <td>0.07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>157</td>\n",
       "      <td>AAAAAAAANJAAAAAA</td>\n",
       "      <td>1997-03-13</td>\n",
       "      <td>None</td>\n",
       "      <td>2451023.0</td>\n",
       "      <td>ation</td>\n",
       "      <td>262.0</td>\n",
       "      <td>7226618.0</td>\n",
       "      <td>8AM-4PM</td>\n",
       "      <td>James Washington</td>\n",
       "      <td>...</td>\n",
       "      <td>3rd</td>\n",
       "      <td>Drive</td>\n",
       "      <td>Suite 20</td>\n",
       "      <td>Midway</td>\n",
       "      <td>Luce County</td>\n",
       "      <td>MI</td>\n",
       "      <td>41904</td>\n",
       "      <td>United States</td>\n",
       "      <td>-5</td>\n",
       "      <td>0.05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>67</td>\n",
       "      <td>AAAAAAAADEAAAAAA</td>\n",
       "      <td>1997-03-13</td>\n",
       "      <td>None</td>\n",
       "      <td>2451034.0</td>\n",
       "      <td>ation</td>\n",
       "      <td>272.0</td>\n",
       "      <td>6504783.0</td>\n",
       "      <td>8AM-4PM</td>\n",
       "      <td>Jim Meyer</td>\n",
       "      <td>...</td>\n",
       "      <td>12th Center</td>\n",
       "      <td>Avenue</td>\n",
       "      <td>Suite J</td>\n",
       "      <td>Shiloh</td>\n",
       "      <td>Luce County</td>\n",
       "      <td>MI</td>\n",
       "      <td>49275</td>\n",
       "      <td>United States</td>\n",
       "      <td>-5</td>\n",
       "      <td>0.02</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   s_store_sk        s_store_id s_rec_start_date s_rec_end_date  \\\n",
       "0          26  AAAAAAAAKBAAAAAA             None     2000-03-12   \n",
       "1          96  AAAAAAAAOFAAAAAA       2001-03-13           None   \n",
       "2         175  AAAAAAAAPKAAAAAA       1997-03-13           None   \n",
       "3         157  AAAAAAAANJAAAAAA       1997-03-13           None   \n",
       "4          67  AAAAAAAADEAAAAAA       1997-03-13           None   \n",
       "\n",
       "   s_closed_date_sk s_store_name  s_number_employees  s_floor_space  s_hours  \\\n",
       "0               NaN        cally                 NaN            NaN  8AM-4PM   \n",
       "1         2450893.0        cally               288.0      5453633.0  8AM-4PM   \n",
       "2               NaN         anti               212.0      8062566.0  8AM-4PM   \n",
       "3         2451023.0        ation               262.0      7226618.0  8AM-4PM   \n",
       "4         2451034.0        ation               272.0      6504783.0  8AM-4PM   \n",
       "\n",
       "          s_manager  ...   s_street_name s_street_type s_suite_number  \\\n",
       "0              None  ...            None        Avenue       Suite 50   \n",
       "1        Denis Shah  ...           View             Wy      Suite 340   \n",
       "2         Paul Vela  ...  Third Railroad            Ln        Suite D   \n",
       "3  James Washington  ...            3rd          Drive       Suite 20   \n",
       "4         Jim Meyer  ...     12th Center        Avenue        Suite J   \n",
       "\n",
       "        s_city         s_county s_state  s_zip      s_country s_gmt_offset  \\\n",
       "0         None             None      GA   None  United States         None   \n",
       "1       Shiloh  Franklin Parish      LA  79275  United States           -6   \n",
       "2  Shady Grove      Luce County      MI  42812  United States           -5   \n",
       "3       Midway      Luce County      MI  41904  United States           -5   \n",
       "4       Shiloh      Luce County      MI  49275  United States           -5   \n",
       "\n",
       "  s_tax_precentage  \n",
       "0             None  \n",
       "1             0.11  \n",
       "2             0.07  \n",
       "3             0.05  \n",
       "4             0.02  \n",
       "\n",
       "[5 rows x 29 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "insert into tpcds_2t_flat_part_clust.store(s_store_sk, s_store_id, s_rec_start_date, s_rec_end_date, s_closed_date_sk,\n",
    "    s_store_name, s_number_employees, s_floor_space, s_hours, s_manager, s_market_id, s_geography_class, s_market_desc,\n",
    "    s_market_manager, s_division_id, s_division_name, s_company_id, s_company_name, s_street_number, s_street_name,\n",
    "    s_street_type, s_suite_number, s_city, s_county, s_state, s_zip, s_country, s_gmt_offset, s_tax_precentage)\n",
    "select s_store_sk, s_store_id, s_rec_start_date, s_rec_end_date, s_closed_date_sk,\n",
    "    s_store_name, s_number_employees, s_floor_space, s_hours, s_manager, s_market_id, s_geography_class, s_market_desc,\n",
    "    s_market_manager, s_division_id, s_division_name, s_company_id, s_company_name, s_street_number, s_street_name,\n",
    "    s_street_type, s_suite_number, s_city, s_county, s_state, s_zip, s_country, s_gmt_offset, s_tax_precentage\n",
    "from `qwiklabs-resources.tpcds_2t_baseline.store`;\n",
    "\n",
    "SELECT * FROM tpcds_2t_flat_part_clust.store LIMIT 5;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare performance\n",
    "\n",
    "In the console UI, copy and paste the below queries and run them as one statement.\n",
    "\n",
    "```sql\n",
    "SELECT * \n",
    "FROM `qwiklabs-resources.tpcds_2t_baseline.store`  \n",
    "WHERE s_rec_start_date > '2010-01-01';\n",
    "\n",
    "\n",
    "SELECT * \n",
    "FROM `qwiklabs-resources.tpcds_2t_flat_part_clust.store`\n",
    "WHERE s_rec_start_date > '2010-01-01';\n",
    "```\n",
    "\n",
    "The results should look like the below"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img/partition-scan.png](img/partition-scan.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why did the first query do an entire table scan of 59 KB (small table) while the second query barely processed any data 117 bytes = .11 KB?\n",
    "\n",
    "It's because the second query hits a partitioned column and it automatically knows that there is no partition for 2010 data (the dataset goes up to 2003). Unlike the first query, it does this without having to open individual records. \n",
    "\n",
    "Although this table is quite small, the same benefit applies to any partitioned table no matter how large."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running the 99 benchmark queries on partitioned tables\n",
    "\n",
    "Below are the `CREATE TABLE` statements for the remaining tables in our new `tpcds_2t_flat_part_clust` dataset. Note how you can still use clustering on a table that does not have an existing field to partition on by simply adding an `empty_date` column of type `date`.\n",
    "\n",
    "Run the below statement to finishing creating the remaining 25 tables for our new partitioned dataset tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery\n",
    "create table tpcds_2t_flat_part_clust.customer_address(\n",
    "    ca_address_sk             int64               NOT NULL,\n",
    "    ca_address_id             string              NOT NULL,\n",
    "    ca_street_number          string                      ,\n",
    "    ca_street_name            string                      ,\n",
    "    ca_street_type            string                      ,\n",
    "    ca_suite_number           string                      ,\n",
    "    ca_city                   string                      ,\n",
    "    ca_county                 string                      ,\n",
    "    ca_state                  string                      ,\n",
    "    ca_zip                    string                      ,\n",
    "    ca_country                string                      ,\n",
    "    ca_gmt_offset             numeric                     ,\n",
    "    ca_location_type          string \t\t\t\t\t  ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY ca_address_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.customer_demographics(\n",
    "    cd_demo_sk                int64               NOT NULL,\n",
    "    cd_gender                 string                       ,\n",
    "    cd_marital_status         string                       ,\n",
    "    cd_education_status       string                      ,\n",
    "    cd_purchase_estimate      int64                       ,\n",
    "    cd_credit_rating          string                      ,\n",
    "    cd_dep_count              int64                       ,\n",
    "    cd_dep_employed_count     int64                       ,\n",
    "    cd_dep_college_count      int64                       ,\n",
    "    empty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY cd_demo_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.date_dim(\n",
    "    d_date_sk                 int64               NOT NULL,\n",
    "    d_date_id                 string              NOT NULL,\n",
    "    d_date                    date                          ,\n",
    "    d_month_seq               int64                       ,\n",
    "    d_week_seq                int64                       ,\n",
    "    d_quarter_seq             int64                       ,\n",
    "    d_year                    int64                       ,\n",
    "    d_dow                     int64                       ,\n",
    "    d_moy                     int64                       ,\n",
    "    d_dom                     int64                       ,\n",
    "    d_qoy                     int64                       ,\n",
    "    d_fy_year                 int64                       ,\n",
    "    d_fy_quarter_seq          int64                       ,\n",
    "    d_fy_week_seq             int64                       ,\n",
    "    d_day_name                string                       ,\n",
    "    d_quarter_name            string                       ,\n",
    "    d_holiday                 string                       ,\n",
    "    d_weekend                 string                       ,\n",
    "    d_following_holiday       string                       ,\n",
    "    d_first_dom               int64                       ,\n",
    "    d_last_dom                int64                       ,\n",
    "    d_same_day_ly             int64                       ,\n",
    "    d_same_day_lq             int64                       ,\n",
    "    d_current_day             string                       ,\n",
    "    d_current_week            string                       ,\n",
    "    d_current_month           string                       ,\n",
    "    d_current_quarter         string                       ,\n",
    "    d_current_year            string                       ,\n",
    "\tempty_date                date                         )\n",
    "\tPARTITION BY empty_date                                   \n",
    "\tCLUSTER BY d_date_sk;                        \n",
    "\n",
    "create table tpcds_2t_flat_part_clust.warehouse(\n",
    "    w_warehouse_sk            int64               NOT NULL,\n",
    "    w_warehouse_id            string              NOT NULL,\n",
    "    w_warehouse_name          string                   ,\n",
    "    w_warehouse_sq_ft         int64                       ,\n",
    "    w_street_number           string                      ,\n",
    "    w_street_name             string                   ,\n",
    "    w_street_type             string                      ,\n",
    "    w_suite_number            string                      ,\n",
    "    w_city                    string                   ,\n",
    "    w_county                  string                   ,\n",
    "    w_state                   string                       ,\n",
    "    w_zip                     string                      ,\n",
    "    w_country                 string                     ,\n",
    "    w_gmt_offset              numeric                    ,\n",
    "    empty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY w_warehouse_sk; \n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.ship_mode(\n",
    "    sm_ship_mode_sk           int64               NOT NULL,\n",
    "    sm_ship_mode_id           string              NOT NULL,\n",
    "    sm_type                   string                      ,\n",
    "    sm_code                   string                      ,\n",
    "    sm_carrier                string                      ,\n",
    "    sm_contract               string                      ,\n",
    "    empty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY sm_carrier; \n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.time_dim(\n",
    "    t_time_sk                 int64               NOT NULL,\n",
    "    t_time_id                 string              NOT NULL,\n",
    "    t_time                    int64                       ,\n",
    "    t_hour                    int64                       ,\n",
    "    t_minute                  int64                       ,\n",
    "    t_second                  int64                       ,\n",
    "    t_am_pm                   string                       ,\n",
    "    t_shift                   string                      ,\n",
    "    t_sub_shift               string                      ,\n",
    "    t_meal_time               string                      ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY t_time;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.reason(\n",
    "    r_reason_sk               int64               NOT NULL,\n",
    "    r_reason_id               string              NOT NULL,\n",
    "    r_reason_desc             string                     ,\n",
    "\tempty_date               date                       )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY r_reason_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.income_band(\n",
    "    ib_income_band_sk         int64               NOT NULL,\n",
    "    ib_lower_bound            int64                       ,\n",
    "    ib_upper_bound            int64                       ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY ib_lower_bound;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.item(\n",
    "    i_item_sk                 int64               NOT NULL,\n",
    "    i_item_id                 string              NOT NULL,\n",
    "    i_rec_start_date          date                          ,\n",
    "    i_rec_end_date            date                          ,\n",
    "    i_item_desc               string                  ,\n",
    "    i_current_price           numeric                  ,\n",
    "    i_wholesale_cost          numeric                  ,\n",
    "    i_brand_id                int64                       ,\n",
    "    i_brand                   string                      ,\n",
    "    i_class_id                int64                       ,\n",
    "    i_class                   string                      ,\n",
    "    i_category_id             int64                       ,\n",
    "    i_category                string                      ,\n",
    "    i_manufact_id             int64                       ,\n",
    "    i_manufact                string                      ,\n",
    "    i_size                    string                      ,\n",
    "    i_formulation             string                      ,\n",
    "    i_color                   string                      ,\n",
    "    i_units                   string                      ,\n",
    "    i_container               string                      ,\n",
    "    i_manager_id              int64                       ,\n",
    "    i_product_name            string                      )\n",
    "\tPARTITION BY i_rec_start_date\n",
    "\tCLUSTER BY i_category;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.store(\n",
    "    s_store_sk                int64               NOT NULL,\n",
    "    s_store_id                string              NOT NULL,\n",
    "    s_rec_start_date          date                          ,\n",
    "    s_rec_end_date            date                          ,\n",
    "    s_closed_date_sk          int64                       ,\n",
    "    s_store_name              string                   ,\n",
    "    s_number_employees        int64                       ,\n",
    "    s_floor_space             int64                       ,\n",
    "    s_hours                   string                      ,\n",
    "    s_manager                 string                   ,\n",
    "    s_market_id               int64                       ,\n",
    "    s_geography_class         string                  ,\n",
    "    s_market_desc             string                  ,\n",
    "    s_market_manager          string                   ,\n",
    "    s_division_id             int64                       ,\n",
    "    s_division_name           string                   ,\n",
    "    s_company_id              int64                       ,\n",
    "    s_company_name            string                   ,\n",
    "    s_street_number           string                   ,\n",
    "    s_street_name             string                   ,\n",
    "    s_street_type             string                      ,\n",
    "    s_suite_number            string                      ,\n",
    "    s_city                    string                   ,\n",
    "    s_county                  string                   ,\n",
    "    s_state                   string                       ,\n",
    "    s_zip                     string                      ,\n",
    "    s_country                 string                   ,\n",
    "    s_gmt_offset              numeric                  ,\n",
    "    s_tax_precentage          numeric                  )\n",
    "\tPARTITION BY s_rec_start_date\n",
    "\tCLUSTER BY s_zip;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.call_center(\n",
    "    cc_call_center_sk         int64               NOT NULL,\n",
    "    cc_call_center_id         string              NOT NULL,\n",
    "    cc_rec_start_date         date                          ,\n",
    "    cc_rec_end_date           date                          ,\n",
    "    cc_closed_date_sk         int64                       ,\n",
    "    cc_open_date_sk           int64                       ,\n",
    "    cc_name                   string                   ,\n",
    "    cc_class                  string                   ,\n",
    "    cc_employees              int64                       ,\n",
    "    cc_sq_ft                  int64                       ,\n",
    "    cc_hours                  string                      ,\n",
    "    cc_manager                string                   ,\n",
    "    cc_mkt_id                 int64                       ,\n",
    "    cc_mkt_class              string                     ,\n",
    "    cc_mkt_desc               string                  ,\n",
    "    cc_market_manager         string                   ,\n",
    "    cc_division               int64                       ,\n",
    "    cc_division_name          string                   ,\n",
    "    cc_company                int64                       ,\n",
    "    cc_company_name           string                      ,\n",
    "    cc_street_number          string                      ,\n",
    "    cc_street_name            string                   ,\n",
    "    cc_street_type            string                      ,\n",
    "    cc_suite_number           string                      ,\n",
    "    cc_city                   string                   ,\n",
    "    cc_county                 string                   ,\n",
    "    cc_state                  string                       ,\n",
    "    cc_zip                    string                      ,\n",
    "    cc_country                string                   ,\n",
    "    cc_gmt_offset             numeric                  ,\n",
    "    cc_tax_percentage         numeric                  )\n",
    "\tPARTITION BY cc_rec_start_date\n",
    "\tCLUSTER BY cc_county;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.customer(\n",
    "    c_customer_sk             int64               NOT NULL,\n",
    "    c_customer_id             string              NOT NULL,\n",
    "    c_current_cdemo_sk        int64                       ,\n",
    "    c_current_hdemo_sk        int64                       ,\n",
    "    c_current_addr_sk         int64                       ,\n",
    "    c_first_shipto_date_sk    int64                       ,\n",
    "    c_first_sales_date_sk     int64                       ,\n",
    "    c_salutation              string                      ,\n",
    "    c_first_name              string                      ,\n",
    "    c_last_name               string                      ,\n",
    "    c_preferred_cust_flag     string                       ,\n",
    "    c_birth_day               int64                       ,\n",
    "    c_birth_month             int64                       ,\n",
    "    c_birth_year              int64                       ,\n",
    "    c_birth_country           string                      ,\n",
    "    c_login                   string                      ,\n",
    "    c_email_address           string                      ,\n",
    "    c_last_review_date_sk     int64                       ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY c_customer_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.web_site(\n",
    "    web_site_sk               int64               NOT NULL,\n",
    "    web_site_id               string              NOT NULL,\n",
    "    web_rec_start_date        date                          ,\n",
    "    web_rec_end_date          date                          ,\n",
    "    web_name                  string                   ,\n",
    "    web_open_date_sk          int64                       ,\n",
    "    web_close_date_sk         int64                       ,\n",
    "    web_class                 string                   ,\n",
    "    web_manager               string                   ,\n",
    "    web_mkt_id                int64                       ,\n",
    "    web_mkt_class             string                   ,\n",
    "    web_mkt_desc              string                  ,\n",
    "    web_market_manager        string                   ,\n",
    "    web_company_id            int64                       ,\n",
    "    web_company_name          string                      ,\n",
    "    web_street_number         string                      ,\n",
    "    web_street_name           string                   ,\n",
    "    web_street_type           string                      ,\n",
    "    web_suite_number          string                      ,\n",
    "    web_city                  string                   ,\n",
    "    web_county                string                   ,\n",
    "    web_state                 string                       ,\n",
    "    web_zip                   string                      ,\n",
    "    web_country               string                   ,\n",
    "    web_gmt_offset            numeric                  ,\n",
    "    web_tax_percentage        numeric                  )\n",
    "\tPARTITION BY web_rec_start_date\n",
    "\tCLUSTER BY web_site_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.store_returns(\n",
    "    sr_returned_date_sk       int64                       ,\n",
    "    sr_return_time_sk         int64                       ,\n",
    "    sr_item_sk                int64               NOT NULL,\n",
    "    sr_customer_sk            int64                       ,\n",
    "    sr_cdemo_sk               int64                       ,\n",
    "    sr_hdemo_sk               int64                       ,\n",
    "    sr_addr_sk                int64                       ,\n",
    "    sr_store_sk               int64                       ,\n",
    "    sr_reason_sk              int64                       ,\n",
    "    sr_ticket_number          int64               NOT NULL,\n",
    "    sr_return_quantity        int64                       ,\n",
    "    sr_return_amt             numeric                 ,\n",
    "    sr_return_tax             numeric                  ,\n",
    "    sr_return_amt_inc_tax     numeric                  ,\n",
    "    sr_fee                    numeric                  ,\n",
    "    sr_return_ship_cost       numeric                  ,\n",
    "    sr_refunded_cash          numeric                  ,\n",
    "    sr_reversed_charge        numeric                  ,\n",
    "    sr_store_credit           numeric                 ,\n",
    "    sr_net_loss               numeric                 ,\n",
    "    empty_date                date                    )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY sr_ticket_number;\n",
    "\t\n",
    "\t\n",
    "create table tpcds_2t_flat_part_clust.household_demographics(\n",
    "    hd_demo_sk                int64               NOT NULL,\n",
    "    hd_income_band_sk         int64                       ,\n",
    "    hd_buy_potential          string                      ,\n",
    "    hd_dep_count              int64                       ,\n",
    "    hd_vehicle_count          int64                       ,\n",
    "    empty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY hd_buy_potential;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.web_page(\n",
    "    wp_web_page_sk            int64               NOT NULL,\n",
    "    wp_web_page_id            string              NOT NULL,\n",
    "    wp_rec_start_date         date                          ,\n",
    "    wp_rec_end_date           date                          ,\n",
    "    wp_creation_date_sk       int64                       ,\n",
    "    wp_access_date_sk         int64                       ,\n",
    "    wp_autogen_flag           string                       ,\n",
    "    wp_customer_sk            int64                       ,\n",
    "    wp_url                    string                      ,\n",
    "    wp_type                   string                      ,\n",
    "    wp_char_count             int64                       ,\n",
    "    wp_link_count             int64                       ,\n",
    "    wp_image_count            int64                       ,\n",
    "    wp_max_ad_count           int64                       )\n",
    "\tPARTITION BY wp_rec_start_date\n",
    "\tCLUSTER BY wp_web_page_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.promotion(\n",
    "    p_promo_sk                int64               NOT NULL,\n",
    "    p_promo_id                string              NOT NULL,\n",
    "    p_start_date_sk           int64                       ,\n",
    "    p_end_date_sk             int64                       ,\n",
    "    p_item_sk                 int64                       ,\n",
    "    p_cost                    numeric                ,\n",
    "    p_response_target         int64                       ,\n",
    "    p_promo_name              string                      ,\n",
    "    p_channel_dmail           string                       ,\n",
    "    p_channel_email           string                       ,\n",
    "    p_channel_catalog         string                       ,\n",
    "    p_channel_tv              string                       ,\n",
    "    p_channel_radio           string                       ,\n",
    "    p_channel_press           string                       ,\n",
    "    p_channel_event           string                       ,\n",
    "    p_channel_demo            string                       ,\n",
    "    p_channel_details         string                      ,\n",
    "    p_purpose                 string                      ,\n",
    "    p_discount_active         string                      ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY p_promo_sk;\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.catalog_page(\n",
    "    cp_catalog_page_sk        int64               NOT NULL,\n",
    "    cp_catalog_page_id        string              NOT NULL,\n",
    "    cp_start_date_sk          int64                       ,\n",
    "    cp_end_date_sk            int64                       ,\n",
    "    cp_department             string                      ,\n",
    "    cp_catalog_number         int64                       ,\n",
    "    cp_catalog_page_number    int64                       ,\n",
    "    cp_description            string                      ,\n",
    "    cp_type                   string                      ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY cp_catalog_page_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.inventory(\n",
    "    inv_date_sk               int64               NOT NULL,\n",
    "    inv_item_sk               int64               NOT NULL,\n",
    "    inv_warehouse_sk          int64               NOT NULL,\n",
    "    inv_quantity_on_hand      int64                       ,\n",
    "\tempty_date                date                        )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY inv_item_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.catalog_returns(\n",
    "    cr_returned_date_sk       int64                       ,\n",
    "    cr_returned_time_sk       int64                       ,\n",
    "    cr_item_sk                int64               NOT NULL,\n",
    "    cr_refunded_customer_sk   int64                       ,\n",
    "    cr_refunded_cdemo_sk      int64                       ,\n",
    "    cr_refunded_hdemo_sk      int64                       ,\n",
    "    cr_refunded_addr_sk       int64                       ,\n",
    "    cr_returning_customer_sk  int64                       ,\n",
    "    cr_returning_cdemo_sk     int64                       ,\n",
    "    cr_returning_hdemo_sk     int64                       ,\n",
    "    cr_returning_addr_sk      int64                       ,\n",
    "    cr_call_center_sk         int64                       ,\n",
    "    cr_catalog_page_sk        int64                       ,\n",
    "    cr_ship_mode_sk           int64                       ,\n",
    "    cr_warehouse_sk           int64                       ,\n",
    "    cr_reason_sk              int64                       ,\n",
    "    cr_order_number           int64               NOT NULL,\n",
    "    cr_return_quantity        int64                       ,\n",
    "    cr_return_amount          numeric                  ,\n",
    "    cr_return_tax             numeric                  ,\n",
    "    cr_return_amt_inc_tax     numeric                  ,\n",
    "    cr_fee                    numeric                  ,\n",
    "    cr_return_ship_cost       numeric                  ,\n",
    "    cr_refunded_cash          numeric                  ,\n",
    "    cr_reversed_charge        numeric                  ,\n",
    "    cr_store_credit           numeric                  ,\n",
    "    cr_net_loss               numeric                  ,\n",
    "\tempty_date                date                      )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY cr_item_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.web_returns(\n",
    "    wr_returned_date_sk       int64                       ,\n",
    "    wr_returned_time_sk       int64                       ,\n",
    "    wr_item_sk                int64               NOT NULL,\n",
    "    wr_refunded_customer_sk   int64                       ,\n",
    "    wr_refunded_cdemo_sk      int64                       ,\n",
    "    wr_refunded_hdemo_sk      int64                       ,\n",
    "    wr_refunded_addr_sk       int64                       ,\n",
    "    wr_returning_customer_sk  int64                       ,\n",
    "    wr_returning_cdemo_sk     int64                       ,\n",
    "    wr_returning_hdemo_sk     int64                       ,\n",
    "    wr_returning_addr_sk      int64                       ,\n",
    "    wr_web_page_sk            int64                       ,\n",
    "    wr_reason_sk              int64                       ,\n",
    "    wr_order_number           int64               NOT NULL,\n",
    "    wr_return_quantity        int64                       ,\n",
    "    wr_return_amt             numeric                  ,\n",
    "    wr_return_tax             numeric                 ,\n",
    "    wr_return_amt_inc_tax     numeric                  ,\n",
    "    wr_fee                    numeric                  ,\n",
    "    wr_return_ship_cost       numeric                  ,\n",
    "    wr_refunded_cash          numeric                  ,\n",
    "    wr_reversed_charge        numeric                  ,\n",
    "    wr_account_credit         numeric                 ,\n",
    "    wr_net_loss               numeric                  ,\n",
    "\tempty_date                date                      )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY wr_web_page_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.web_sales(\n",
    "    ws_sold_date_sk           int64                       ,\n",
    "    ws_sold_time_sk           int64                       ,\n",
    "    ws_ship_date_sk           int64                       ,\n",
    "    ws_item_sk                int64               NOT NULL,\n",
    "    ws_bill_customer_sk       int64                       ,\n",
    "    ws_bill_cdemo_sk          int64                       ,\n",
    "    ws_bill_hdemo_sk          int64                       ,\n",
    "    ws_bill_addr_sk           int64                       ,\n",
    "    ws_ship_customer_sk       int64                       ,\n",
    "    ws_ship_cdemo_sk          int64                       ,\n",
    "    ws_ship_hdemo_sk          int64                       ,\n",
    "    ws_ship_addr_sk           int64                       ,\n",
    "    ws_web_page_sk            int64                       ,\n",
    "    ws_web_site_sk            int64                       ,\n",
    "    ws_ship_mode_sk           int64                       ,\n",
    "    ws_warehouse_sk           int64                       ,\n",
    "    ws_promo_sk               int64                       ,\n",
    "    ws_order_number           int64               NOT NULL,\n",
    "    ws_quantity               int64                       ,\n",
    "    ws_wholesale_cost         numeric                  ,\n",
    "    ws_list_price             numeric                  ,\n",
    "    ws_sales_price            numeric                  ,\n",
    "    ws_ext_discount_amt       numeric                  ,\n",
    "    ws_ext_sales_price        numeric                  ,\n",
    "    ws_ext_wholesale_cost     numeric                  ,\n",
    "    ws_ext_list_price         numeric                  ,\n",
    "    ws_ext_tax                numeric                  ,\n",
    "    ws_coupon_amt             numeric                  ,\n",
    "    ws_ext_ship_cost          numeric                  ,\n",
    "    ws_net_paid               numeric                  ,\n",
    "    ws_net_paid_inc_tax       numeric                  ,\n",
    "    ws_net_paid_inc_ship      numeric                  ,\n",
    "    ws_net_paid_inc_ship_tax  numeric                  ,\n",
    "    ws_net_profit             numeric                  ,\n",
    "\tempty_date                date                     )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY ws_item_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.catalog_sales(\n",
    "    cs_sold_date_sk           int64                       ,\n",
    "    cs_sold_time_sk           int64                       ,\n",
    "    cs_ship_date_sk           int64                       ,\n",
    "    cs_bill_customer_sk       int64                       ,\n",
    "    cs_bill_cdemo_sk          int64                       ,\n",
    "    cs_bill_hdemo_sk          int64                       ,\n",
    "    cs_bill_addr_sk           int64                       ,\n",
    "    cs_ship_customer_sk       int64                       ,\n",
    "    cs_ship_cdemo_sk          int64                       ,\n",
    "    cs_ship_hdemo_sk          int64                       ,\n",
    "    cs_ship_addr_sk           int64                       ,\n",
    "    cs_call_center_sk         int64                       ,\n",
    "    cs_catalog_page_sk        int64                       ,\n",
    "    cs_ship_mode_sk           int64                       ,\n",
    "    cs_warehouse_sk           int64                       ,\n",
    "    cs_item_sk                int64               NOT NULL,\n",
    "    cs_promo_sk               int64                       ,\n",
    "    cs_order_number           int64               NOT NULL,\n",
    "    cs_quantity               int64                       ,\n",
    "    cs_wholesale_cost         numeric                  ,\n",
    "    cs_list_price             numeric                  ,\n",
    "    cs_sales_price            numeric                  ,\n",
    "    cs_ext_discount_amt       numeric                 ,\n",
    "    cs_ext_sales_price        numeric                  ,\n",
    "    cs_ext_wholesale_cost     numeric                 ,\n",
    "    cs_ext_list_price         numeric                  ,\n",
    "    cs_ext_tax                numeric                  ,\n",
    "    cs_coupon_amt             numeric                  ,\n",
    "    cs_ext_ship_cost          numeric                  ,\n",
    "    cs_net_paid               numeric                  ,\n",
    "    cs_net_paid_inc_tax       numeric                 ,\n",
    "    cs_net_paid_inc_ship      numeric                 ,\n",
    "    cs_net_paid_inc_ship_tax  numeric                  ,\n",
    "    cs_net_profit             numeric                  ,\n",
    "\tempty_date                date                     )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY cs_item_sk;\n",
    "\n",
    "\n",
    "create table tpcds_2t_flat_part_clust.store_sales(\n",
    "    ss_sold_date_sk           int64                       ,\n",
    "    ss_sold_time_sk           int64                       ,\n",
    "    ss_item_sk                int64               NOT NULL,\n",
    "    ss_customer_sk            int64                       ,\n",
    "    ss_cdemo_sk               int64                       ,\n",
    "    ss_hdemo_sk               int64                       ,\n",
    "    ss_addr_sk                int64                       ,\n",
    "    ss_store_sk               int64                       ,\n",
    "    ss_promo_sk               int64                       ,\n",
    "    ss_ticket_number          int64               NOT NULL,\n",
    "    ss_quantity               int64                       ,\n",
    "    ss_wholesale_cost         numeric                  ,\n",
    "    ss_list_price             numeric                  ,\n",
    "    ss_sales_price            numeric                  ,\n",
    "    ss_ext_discount_amt       numeric                  ,\n",
    "    ss_ext_sales_price        numeric                  ,\n",
    "    ss_ext_wholesale_cost     numeric                  ,\n",
    "    ss_ext_list_price         numeric                  ,\n",
    "    ss_ext_tax                numeric                  ,\n",
    "    ss_coupon_amt             numeric                  ,\n",
    "    ss_net_paid               numeric                  ,\n",
    "    ss_net_paid_inc_tax       numeric                  ,\n",
    "    ss_net_profit             numeric                  ,\n",
    "\tempty_date                date                     )\n",
    "\tPARTITION BY empty_date\n",
    "\tCLUSTER BY ss_item_sk"
   ]
  },
  {
   "source": [
    "### Ingesting data into partitioned tables\n",
    "\n",
    "We could simply `INSERT INTO` the billions of records from our baseline dataset into our new partitioned dataset but that would take quite a while (a few hours). Instead, we'll simply use the new [BigQuery Dataset Copy API (beta)](https://cloud.google.com/bigquery/docs/copying-datasets) to populate the tables from an already existing solution in our `qwiklabs-resources` project.\n",
    "\n",
    "### Use the BigQuery Data Transfer Service to copy an existing dataset\n",
    "\n",
    "1. Enable the [BigQuery Data Transfer Service API](https://console.cloud.google.com/apis/library/bigquerydatatransfer.googleapis.com)\n",
    "2. Navigate to the [BigQuery console and the existing `qwiklabs-resources` dataset](https://console.cloud.google.com/bigquery?project=qwiklabs-resources&p=qwiklabs-resources&d=tpcds_2t_baseline&page=dataset)\n",
    "\n",
    "![img/open-dataset.png](img/open-dataset.png)\n",
    "\n",
    "3. Click Copy Dataset\n",
    "\n",
    "![img/copy-dataset-2.png](img/copy-dataset-2.png)\n",
    "\n",
    "4. In the pop-up, choose your __project name__ and make sure to provide the same dataset name\n",
    "\n",
    "![img/copy-dataset-modal.png](img/copy-dataset-modal.png)\n",
    "\n",
    "5. Click __Copy__\n",
    "\n",
    "6. Wait for the transfer to complete\n"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verify you now have the baseline data in your project\n",
    "\n",
    "Run the below query and confirm you see data. Note that if you omit the `project-id` ahead of the dataset name in the `FROM` clause, BigQuery will assume your default project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>store_transaction_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>5762820700</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   store_transaction_count\n",
       "0               5762820700"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT COUNT(*) AS store_transaction_count\n",
    "FROM tpcds_2t_flat_part_clust.store_sales"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running a few benchmark queries with a shell script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# runs the SQL queries from the TPCDS benchmark \n",
    "\n",
    "# Pull the current Google Cloud Platform project name\n",
    "export PROJECT=$(gcloud config list project --format \"value(core.project)\")\n",
    "\n",
    "BQ_DATASET=\"tpcds_2t_flat_part_clust\" # let's benchmark our new dataset\n",
    "QUERY_FILE_PATH=\"/home/jupyter/$PROJECT/02_add_partition_and_clustering/solution/sql/full_performance_benchmark.sql\" #sample_benchmark.sql\n",
    "IFS=\";\"\n",
    "\n",
    "# create perf table to keep track of run times for all 99 queries\n",
    "printf \"\\033[32;1m Housekeeping tasks... \\033[0m\\n\\n\";\n",
    "printf \"Creating a reporting table perf to track how fast each query runs...\";\n",
    "perf_table_ddl=\"CREATE TABLE IF NOT EXISTS $BQ_DATASET.perf(performance_test_num int64, query_num int64, elapsed_time_sec int64, ran_on int64)\"\n",
    "bq rm -f $BQ_DATASET.perf\n",
    "bq query --nouse_legacy_sql $perf_table_ddl \n",
    "\n",
    "start=$(date +%s)\n",
    "index=0\n",
    "for select_stmt in $(<$QUERY_FILE_PATH)　\n",
    "do \n",
    "  # run the test until you hit a line with the string 'END OF BENCHMARK' in the file\n",
    "  if [[ \"$select_stmt\" == *'END OF BENCHMARK'* ]]; then\n",
    "    break\n",
    "  fi\n",
    "\n",
    "  printf \"\\n\\033[32;1m Let's benchmark this query... \\033[0m\\n\";\n",
    "  printf \"$select_stmt\";\n",
    "  \n",
    "  SECONDS=0;\n",
    "  bq query --use_cache=false --nouse_legacy_sql $select_stmt # critical to turn cache off for this test\n",
    "  duration=$SECONDS\n",
    "\n",
    "  # get current timestamp in milliseconds  \n",
    "  ran_on=$(date +%s)\n",
    "\n",
    "  index=$((index+1))\n",
    "\n",
    "  printf \"\\n\\033[32;1m Here's how long it took... \\033[0m\\n\\n\";\n",
    "  echo \"Query $index ran in $(($duration / 60)) minutes and $(($duration % 60)) seconds.\"\n",
    "\n",
    "  printf \"\\n\\033[32;1m Writing to our benchmark table... \\033[0m\\n\\n\";\n",
    "  insert_stmt=\"insert into $BQ_DATASET.perf(performance_test_num, query_num, elapsed_time_sec, ran_on) values($start, $index, $duration, $ran_on)\"\n",
    "  printf \"$insert_stmt\"\n",
    "  bq query --nouse_legacy_sql $insert_stmt\n",
    "done\n",
    "\n",
    "end=$(date +%s)\n",
    "\n",
    "printf \"Benchmark test complete\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmarking all 99 queries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>performance_test_num</th>\n",
       "      <th>query_num</th>\n",
       "      <th>elapsed_time_sec</th>\n",
       "      <th>ran_on</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1571036800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>1571036810</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>3</td>\n",
       "      <td>67</td>\n",
       "      <td>1571036881</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>1571036891</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>1571036900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>95</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>96</td>\n",
       "      <td>8</td>\n",
       "      <td>1571040738</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>96</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>97</td>\n",
       "      <td>25</td>\n",
       "      <td>1571040767</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>97</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>98</td>\n",
       "      <td>25</td>\n",
       "      <td>1571040795</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>98</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>99</td>\n",
       "      <td>25</td>\n",
       "      <td>1571040823</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>99</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>100</td>\n",
       "      <td>1</td>\n",
       "      <td>1571040828</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    performance_test_num  query_num  elapsed_time_sec      ran_on\n",
       "0             1571036796          1                 4  1571036800\n",
       "1             1571036796          2                 7  1571036810\n",
       "2             1571036796          3                67  1571036881\n",
       "3             1571036796          4                 6  1571036891\n",
       "4             1571036796          5                 6  1571036900\n",
       "..                   ...        ...               ...         ...\n",
       "95            1571036796         96                 8  1571040738\n",
       "96            1571036796         97                25  1571040767\n",
       "97            1571036796         98                25  1571040795\n",
       "98            1571036796         99                25  1571040823\n",
       "99            1571036796        100                 1  1571040828\n",
       "\n",
       "[100 rows x 4 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT * FROM `qwiklabs-resources.tpcds_2t_flat_part_clust.perf` # public table\n",
    "WHERE \n",
    " # Let's only pull the results from our most recent test\n",
    " performance_test_num = (SELECT MAX(performance_test_num) FROM `qwiklabs-resources.tpcds_2t_flat_part_clust.perf`)\n",
    "ORDER BY ran_on"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_date</th>\n",
       "      <th>latest_performance_test_num</th>\n",
       "      <th>count_queries_benchmarked</th>\n",
       "      <th>total_time_sec</th>\n",
       "      <th>fastest_query_time_sec</th>\n",
       "      <th>slowest_query_time_sec</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>2019-10-14 07:06:36+00:00</td>\n",
       "      <td>1571036796</td>\n",
       "      <td>100</td>\n",
       "      <td>3680</td>\n",
       "      <td>1</td>\n",
       "      <td>166</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  test_date  latest_performance_test_num  \\\n",
       "0 2019-10-14 07:06:36+00:00                   1571036796   \n",
       "\n",
       "   count_queries_benchmarked  total_time_sec  fastest_query_time_sec  \\\n",
       "0                        100            3680                       1   \n",
       "\n",
       "   slowest_query_time_sec  \n",
       "0                     166  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "  TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,\n",
    "  COUNT(DISTINCT query_num) AS count_queries_benchmarked,\n",
    "  SUM(elapsed_time_sec) AS total_time_sec,\n",
    "  ROUND(SUM(elapsed_time_sec)/60,2) AS total_time_min,\n",
    "  MIN(elapsed_time_sec) AS fastest_query_time_sec,\n",
    "  MAX(elapsed_time_sec) AS slowest_query_time_sec,\n",
    "  ROUND(AVG(elapsed_time_sec),2) AS avg_query_time_sec\n",
    "FROM\n",
    "  `qwiklabs-resources.tpcds_2t_flat_part_clust.perf`\n",
    "WHERE\n",
    "  performance_test_num = (SELECT MAX(performance_test_num) FROM `qwiklabs-resources.tpcds_2t_flat_part_clust.perf`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "\n",
    "The total time for the benchmark queries on our newly partitioned dataset is 1307 seconds or 22 minutes. (That's 24% faster than 29 minutes for the baseline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare vs baseline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using SQL we can compare our benchmark tests pretty easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery\n",
    "# TODO write where clause filters to pull latest performance from each table (and debug why they keep getting truncated)\n",
    "WITH \n",
    "add_part AS (\n",
    "SELECT * FROM `qwiklabs-resources.tpcds_2t_flat_part_clust.perf`)\n",
    "\n",
    ", base AS (\n",
    "SELECT * FROM `qwiklabs-resources.tpcds_2t_baseline.perf` )\n",
    "\n",
    "SELECT \n",
    "  base.query_num,\n",
    "  base.elapsed_time_sec AS elapsed_time_sec_base,\n",
    "  add_part.elapsed_time_sec AS elapsed_time_sec_add_part,\n",
    "  add_part.elapsed_time_sec - base.elapsed_time_sec AS delta,\n",
    "  ROUND((-100)*(add_part.elapsed_time_sec - base.elapsed_time_sec)/base.elapsed_time_sec,2) AS percent_improvement\n",
    "FROM base JOIN add_part USING(query_num)\n",
    "ORDER BY delta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Final Activity:__ Create a Data Studio report or ipynb visualization showing the differences between performance. Which queries saw the most improvement?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
